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Abstract 

We consider the problem of analyzing social network data sets in which the edges of the 
network have timestamps, and we wish to analyze the subgraphs formed from edges in contiguous 
subintervals of these timestamps. We provide data structures for these problems that use near- 
linear preprocessing time, linear space, and sublogarithmic query time to handle queries that ask 
for the number of connected components, number of components that contain cycles, number 
of vertices whose degree equals or is at most some predetermined value, number of vertices that 
can be reached from a starting set of vertices by time-increasing paths, and related queries. 



1 Introduction 



The study of algorithms for social network analysis has so far been concentrated primarily on 
computations involving graphs that are relatively static: either a fixed graph is given as input to 
algorithms for problems such as the computation of centrality [2~[ |14[[25||32| , or the input graph is 
assumed to change gradually by insertions and deletions of vertices and edges, and these changes 
can be handled efficiently by dynamic graph algorithms [llf|13] . These input models work well for 
networks that describe long-term ties such as friendship or supervisorial relations between people, 
and they often match the information provided by online service providers such as Facebook. 
However, there is a second type of social network data set, known variously as relational event 
data [5j, dyadic event data [3j, longitudinal network data [19[|31| , contact sequences |23| , or time- 
ordered networks [29], on which we would also like to perform efficient computations. This type of 
data models communication events between pairs of people rather than long-term ties; for instance, 
each datum in a data set might consist of the identities of the sender and recipient of a single email 
message along with its timestampj^] At any instant of sampled time there is no graph, only a single 
edge. 

It is only by grouping together multiple events over sliding windows of time that we can form 
a network from this type of relational event data [8 26 , 30 . If a fixed window size is chosen, the 



sequence of time windows can be modeled by the insertion of an edge when it enters the window, and 
deletion when it leaves the window; these types of change are familiar in the analysis of algorithms 
as the basis for many dynamic graph algorithms |11| . However, the dynamic graph model is too 
restrictive for our purposes, because it may not be obvious what length of time window to use. 
Long windows present a relatively static view of the data, aggregating it into a single graph but 
losing all dynamic time information. Short windows capture the dynamics better but may be too 
sparse to see the entire pattern of connections; indeed, for very short windows, the subset of data 
within any window may have few or no vertices with degree higher than one. Thus, it may be useful 
to perform exploratory data analysis by testing different sizes of window to find the one that best 
balances connectivity with dynamics, or to study the same data set with multiple window sizes to 
show how its behavior varies with the time scale. 

In this paper we provide for the first time an algorithmic model for the analysis of relational 
event data with windows chosen dynamically rather than a priori, and we also develop fundamental 
data structures that can perform this analysis efficiently. In our model, the input to a relational 
event data analysis problem is formulated as a sequence of (directed or undirected) edges, and we 
define a slice of the data to be the graph formed by a contiguous subsequence of the input. The 
data structures that we describe represent the entire relational data set using only linear space, 
can be constructed in near-linear time, and support queries that ask for statistical information 
about arbitrary slices of the data in sublogarithmic time per query. The graph properties that our 
data structures can handle include many quantities already studied (for static networks) by social 
networking researchers, including the following: 

• We can count the number of connected components of a slice, the number of nontrivial 
components (with more than one vertex), the number of loopy components (components 
that contain at least one cycle), the average size of a connected component, the average 
size of a nontrivial connected component, and the number of loopy edges (edges that close 
a cycle). Such queries concern the connectivity structure of the network, providing insight 
into diffusion processes and the robustness of these processes to intervention. In sexual 



For example, Kossinets and Watts describe an email data set of this type with approximately 7 million email 



messages, sent by approximately 30,000 people within a single university campus over the course of a year 26 
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networks, for example, a scarcity of short cycles implies the absence of a dense core, a fact 
with implications for the study and treatment of HIV jTJ[35] and Gonorrhoea [9 34 . In 
such networks loopy edges represent reinfection events, which fuel the growth phase of an 
outbreak 1341. 



We can count the number of isolated vertices in a slice, the number of isolated edges, the 
number of edges that have d neighboring edges for any constant d, and the number of vertices 
that have exactly or at most d neighboring edges or vertices for any predetermined (but 
not necessarily constant) d. These parameters provide access to the degree distribution of a 
network, which has long been recognized as important in social network analysis; for instance, 



Seidman 37 emphasizes the importance of distinguishing between networks with uniform 



degrees from networks in which there are a few high degree vertices and many low degree 
vertices. 

We can count the number of repeated edges within a slice, the number of edges with mul- 
tiplicity exactly or at most fi, and the reciprocity (relative proportion of pairs of vertices 
connected by directed edges in both directions to pairs connected only in one direction). In 
social networks where directed edges represent communication, reciprocity has been used to 



quantify the amount of social interaction versus broadcast communication 18 , 20 



• We can count the number of vertices that can be reached from a predetermined set of starting 
vertices, via (directed or undirected) paths in which the timestamps of the edges are mono- 
tonically increasing. These paths have also been called journeys [4] or diffusion paths [29] ; 
they are the possible transmission routes of information or contagion through the network, 
and the number of reachable nodes has been studied (for the aggregate graph rather than for 
slices) by Holme |23|. We can also count the number of vertices that can be reached by paths 
of this type with bounded hop count. 

• We can count the number of triad closure events in which an edge belongs to at least one 
triangle formed by it and earlier edges within the slice. Triads and related transitivity proper- 
ties such as the clustering coefficient have long been recognized as important in the structure 



of social networks 21 36 and the number of triad closure events, like monotonic reachability, 



also incorporates time-dependence in its definition. Our data structure for counting them 
involves somewhat slower preprocessing, comparable to the time to find or count triangles in 
a static graph. 



We show how to reduce each of these problems to two-dimensional 
dominance counting queries on point sets derived from the input net- 
work, with 0(1) points per network edge. In the variant of two- 
dimensional dominance counting that we use, a data set consists of 
a set of n two-dimensional points with integer coordinates (x, y) sat- 
isfying 1 < y < x < n, and each query must determine the number 
of points dominated by a query point q = (xo,yo), i.e., the number 
of points (x, y) with x < xq and y < yo- Through a slight abuse of 
terminology we will use the term dominance counting when counting 
the number of points contained in any one of the four quadrants cre- 
ated by placing the query point q in the plane. As we describe, many 
of the problems listed above can be reduced to dominance counting 
through a common unification in terms of the rank function of ma- 
troids, generalizing a data transformation for one-dimensional colored 
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Figure 1: Example dom- 
inance query where the 
points in the shaded region 
are counted. 
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range counting given by Gupta et al. 22 . For the remaining problems (including isolated edges 
and monotonic reachability) our reduction instead passes through another two-dimensional range 
searching problem, rectangle stabbing. Combining our reductions with known dominance count- 
ing data structures [24] would allow us to solve these problems in linear space and query time 
0(logm/loglogm), where m is the number of edges in the input network; in an appendix we pro- 
vide a refined dominance counting structure that improves the query time to 0(logw/loglogm) 
where w is the number of edges in the queried slice. We also outline an alternative solution based 
on path-copying persistence |lOj and balanced binary trees that we expect to be more suitable for 
implementation; it uses 0(m log m) space and gives query time 0(logw;). 

In another appendix, we adapt a lower bound of Mihai Patra§cu for two-dimensional range 
counting [33] to the problems studied here, using reductions in the other direction from point sets 
to networks. We show that, in the cell probe model, with query time measured as a function only 
of the input size (rather than of the window size, as in our upper bounds) all data structures that 
use space 0(mlog°^ m) must take O(logm/loglogm) query time to solve many of the queries 
considered here. 



2 Problem formulation 

Define a relational event graph G to be a fixed set of vertices V together with a sequence of edges 
(or relational events) E = {et | < k < m} between pairs of vertices. The graph is undirected 
if the pairs are unordered, and directed if the pairs are ordered. The pairs in the sequence are 
not required to be distinct from each other. Given a relational event graph G we define the slice 
multigraph Gij to be the multigraph with vertices V and edges {e^ | i < k < j}. 

We assume that the entire relational event graph G is given to us as input. Our task is to 
construct a data structure from G that will allow us to compute the properties of its slices Gij, 
for a query pair of indices To avoid trivial solutions, queries in such a data structure should 
take less time than the Q(j — i) of an algorithm that constructs the slice multigraph and applies 
a static graph algorithm to it, and the data structure should use less space than the VL(m 2 ) of an 
algorithm that precomputes and stores the answers to all possible queries. 



3 Matroid rank 



We will turn many of our queries into a matroid rank problem. Recall |27 40 that a matroid over 



a set S is a collection of subsets of S called independent sets obeying the following three properties: 

• The empty set is independent. 

• Every subset of an independent set is independent. 

• If U and V are independent sets and U is larger than V, then there exists u £ U such that 
V U {u} is independent. 

The rank of a set E C S is the size of the largest independent subset of E, and a circuit is a 
minimal dependent subset (i.e., a set whose proper subsets are all independent). 

We will define matroids over sequences E = {e^ | < k < m} of elements (usually the edges 
of our input graph); we form slices Eij = {e^ \ i < k < j} from contiguous subsequences of this 
sequence. The rank of Eij will then be useful for computing the numerical graph quantities we 
wish to compute; for instance, we will use the graphic matroid (whose rank is the number of edges 
in a spanning forest) to determine the numbers of connected components and loopy edges in a slice. 
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In order to calculate the rank of a slice, we define the independence time r(efc) for an element 
to be the smallest index i such that rank^E^.) > rank(i£j or —1 if the inequality holds for all 
i. A straightforward induction on j — i shows that the rank of every slice E{ j equals the number 
of matroid elements in the slice whose independence is before the interior of the slice. 

Our meta-algorithm for precomputing r (needing details to be filled in for specific matroids) 
assigns a weight to each element, equal to its index. It then incrementally considers the elements 
in sequence order, maintaining as it does a maximum-weight basis of the set of elements considered 
so far. When adding element to the basis would cause it to remain independent, the augmented 
set becomes the new basis and in this case we set r(efc) = —1. However, when the previous basis 
and the new element together contain a circuit (necessarily a unique circuit), we form the new basis 
by removing the lightest element from this circuit, and adding et in its place; in this case, r(ek) is 
one more than the index of the removed element. Later, when we discuss specific matroids, we will 
describe how to quickly identify the circuit containing the new element and the lightest element of 
this circuit. 

Lemma 1. The elements of E can be mapped to points in R 2 such that the rank of E{j can be 
determined by a dominance counting query. The time needed for this mapping is the same as the 
time to compute r for all elements in E. 

Proof. We map each et to (k, r(efc)). To determine the rank of Eij we count the number of elements 
whose independence time is in the slice's range of indices, i.e., i < k < j and i < r(ek). This three 
sided query can be reduced to the dominance counting query k < j and i < r(efc), as it is not 
possible to have k < r(efc). We then take the complement to count the edges whose independence 
time is before the interval. □ 



4 Counting vertices by degree and edges by multiplicity 

In this section we use partition matroids (a standard type of matroid, defined below) to determine 
the number of vertices of bounded degree, the number of vertices of a specific degree, the number 
of edges of bounded multiplicity, the number of edges of a given multiplicity, and the reciprocity 
of a slice in a relational event graph. Our techniques can also solve the colored range counting 
problem considered in [22], using colors to define the partition, and our Lemma [2] generalizes their 
data transformation approach for colored range counting. 

In general, a partition matroid is defined over a set S that has been partitioned into a family 
of disjoint subsets Pi for 1 < i < p, each of which is associated with a numeric parameter k{. A 
subset A of S is defined to be independent in the matroid if A n Pi has at most fcj elements for 
each 1 < i < p. Each circuit of this matroid is a subset of exactly ki + 1 elements of one of the 
sets Pi. It is straightforward to verify that the independence system defined in this way satisfies 
the axioms of a matroid [40] . 

Given a relational event graph G with edge sequence E = {a} and a parameter k we consider 
a partition matroid whose elements are the set of half-edges (ui, 2i) and (vi, 2i + 1) for each edge 
&i = {ui, Vi} in E. In this matroid we define a set S to be independent if each vertex of G appears 
at most k times in the first component of a half-edge, where k is a fixed parameter. That is, 
there is one set P u for each vertex u, containing all the half-edges (u,v), and the corresponding 
partition matroid parameter is k u = k. The rank of this partition matroid is given by rankfc(S') = 
X;„ e ymax(deg(u),/c). 

Lemma 2. We can compute r for the partition matroid described above in linear time and space. 
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Proof. We process the half-edges in index order, adding them to a dictionary that associates each 
vertex with a /c-element queue of insertion times. When inserting a new half-edge, if its endpoint's 
queue is full, we dequeue the top half-edge and record one more than its index for r of the current 
half-edge; otherwise we record —1 for t of the edge. Then, regardless of whether the queue was 
full, we add the half-edge to the queue. By storing the queues as linked lists this can be done in 
linear space and time. □ 

Theorem 1. Given a relational event graph G the problems of determining the number of isolated 
vertices, vertices of a given degree d, and vertices of bounded degree in Gi j can be reduced to 
dominance counting in linear time and space. 

Proof. We use our solution to the matroid rank problem (Lemma [I]) to create two data structures 
for the partition matroid with k = d and with k = d + 1. Then by performing two dominance 
counting queries we can compute 

rankrf +1 (Gjj) — rankd(Gjj) = max(deg(t> ), d + 1) — max(deg(v), d) 

V 

which is equal to the number of vertices of degree greater than d. With the ability to count the 
number of vertices of degree greater than d we can easily compute the queries stated in the theorem 
by inclusion-exclusion. □ 

We define the multiplicity of an edge in a directed or undirected relational event graph to be 
the number of other edges that have the same two endpoints (as an ordered or unordered pair, 
respectively). We can count the distinct edges, the edges that have a given multiplicity, or the 
edges that have bounded multiplicity, using a partition matroid whose elements are the edges, and 
whose partitions group together edges that have the same ordered or unordered pair of endpoints. 

Theorem 2. Given a (directed) relational event graph G the problems of determining the number 
of distinct directed or undirected edges, edges of bounded multiplicity, reciprocated edges, and the 
reciprocity in Gij can be reduced to dominance counting in linear time and space. 

Proof. The number of edges with given multiplicity can be counted using ranks in a partition 
matroid, as in Theorem [TJ 

To determine the number of reciprocated edges, we count the number of distinct edges in two 
different ways, interpreting the same graph once as a directed graph and a second time as an 
undirected graph. Reciprocated edges are counted twice as distinct directed edges but only once 
as undirected, and unreciprocated edges are counted once either way, so the number of recipro- 
cated edges is the difference between the numbers of distinct directed and undirected edges. The 
reciprocity is then the ratio of reciprocated edges to all edges. □ 



5 Counting connected components 

To count the number of connected components in a graph we use the graphic matroid |40|, another 
standard type of matroid. The graphic matroid of an undirected graph G has the edges of G as its 
elements; a set of edges is independent in the graphic matroid if it forms a forest. A circuit in the 
graphic matroid is a simple cycle in G, and the rank of the graphic matroid on a set of edges is the 
number of vertices minus the number of connected components. 

To count connected components that contain cycles we use the bicycle matroid (or bicircular 
matroid) [28] , a somewhat less- well-known matroid that also has the edges of a graph as its elements. 
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In the bicycle matroid, a set of edges is independent if it forms a pseudoforest, a graph that has 
at most one cycle per connected component or equivalently a graph in which each subgraph has at 
most as many edges as vertices. The rank of the bicycle matroid on a set of edges is the number of 
vertices minus the number of tree components. 



To precompute r for both of these matroids, we use linking and cutting trees 38,39 . These are 
data structures that may be used to represent a rooted forest, subject to updates that either insert 
edges (if the result of the insertion would still be a forest) or delete them. Cutting and linking trees 
also allow operations to look up the root of the tree containing a query vertex or the lightest edge 
on any path. Both updates and queries take logarithmic time per operation, and the overall data 
structure uses linear space. 

Lemma 3. We can precompute r for the graphic matroid in O(mlogn) time and linear space. 

Proof. We store the vertices of G into a linking and cutting tree, and then process the edges in 
increasing order. When adding an edge e, = {ui,Vi} to the forest we check if it creates a cycle. If 
so, we find the lightest edge on the path from m to Vi, record one more than its index as r(ej), 
remove the light edge from the forest, and add to the forest. If adding a does not create a cycle, 
then we add it to the forest and record —1 for r(ej). For each edge we do O(logn) work, for a total 
processing time of 0(m log n). □ 

Theorem 3. Given a relational event graph G the problems of determining the number of connected 
components, nontrivial connected components, average size of a connected component, average size 
of a nontrivial connected component, and the number of loopy edges in G{j can be reduced to 
dominance counting in O(mlogn) time and linear space. 

Proof. The number of connected components follows from the matroid rank problem (Lemma [If 
together with Lemma [3j For nontrivial components we use Theorem [T] to count isolated vertices 
and subtract this value from the number of forests. The average component sizes can then be 
computed easily. To count the number loopy edges we observe that the number of loopy edges 
equals the total number of edges minus the number of edges in a spanning forest, i.e., it is the 
number of edges in the slice minus the graphic matroid rank of the slice. □ 

When computing r for the bicycle matroid we will need to dynamically maintain a pseudoforest. 
To do this we augment the linking and cutting tree with a dictionary whose keys are the tree roots 
and whose associated values are the lightest edges in the cycles of the corresponding pseudotrees (or 
null for tree components). Thus, the linking and cutting tree always stores the maximum spanning 
forest, and the dictionary holds the missing edges of each pseudotree. 

Lemma 4. We can precompute r for the bicycle matroid in 0{m log n) time and linear space. 

Proof. When adding a edge = {uk,Vk} we consider five possible cases: (1) two trees are joined, 
(2) a cycle is created in a tree, (3) a tree and a pseudotree are joined, (4) a second cycle is formed 
in a pseudotree, (5) two pseudotrees are joined. 

In cases (1), (2) and (3) we are left with a pseudoforest so we record —1 for r(efc). In case (2) 
we remove the lightest edge on the cycle formed by from the linking and cutting tree, and place 
it in the dictionary; in all cases we add e& to the linking and cutting tree. In case (3) we update 
the key for the lightest edge in the pseudotree with its new root, if the root changes. 

In cases (4) and (5) we find and discard the lightest edge in the union of the two cycles and 
the path (if it exists) between them, either returning to a component with one cycle or splitting 
it into two components each with a cycle. We update the cutting and linking tree and dictionary, 
and record one more than the index of the discarded edge as r(efc). 
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Since we only added 0(n) space and 0(m log n) time to the procedure in Lemma[3]we have the 
same space and time bounds. □ 

Theorem 4. Given a relational event graph G the problems of determining the number of loopy 
components, tree components, and nontrivial tree components in Gij can be reduced to dominance 
counting in O(mlogn) time and linear space. 

Proof. The number of trees follows from the matroid rank problem (Lemma [TJ and Lemma |4j For 
loopy components we also build the data structure in Theorem [3] and subtract the number of trees 
from the number of connected components. For nontrivial trees we build the data structure in 
Theorem [T] and subtract the number of isolated vertices from the number of trees. □ 



6 Counting edge neighbors 

In this section we compute the number of edges that have a given or bounded number of neighboring 
edges. This does not seem to be an instance of the matroid rank problem. Instead, we reduce it to 
a rectangle stabbing problem. 

For an edge et we define a past neighbor to be an edge e« sharing at least one vertex with 
and having i < k, and we define a future neighbor to be an edge ej sharing at least one vertex with 
efc and having k < j. Let ir r {^k) denote the least i such that e/% has r neighbors in G± & by n r (ek), 
and let r (efc) denote the greatest j such that has s neighbors in Gkj- 

Lemma 5. We can precompute n r and <p s in 0((r + s)m) time and linear space. 

Proof. First we precompute 7r r by processing the edges (as half-edges) in index order, adding them 
to a dictionary structure, as in Lemma [2j indexed by the vertex and storing the insertion times in 
r-sized queues. When inserting a new edge = {uk,Vk} we consider the queues for both Uk and 
Vk- If the sum of the sizes of the two queues is less than r then we record —1 for 7r r (efc). Otherwise 
we iterate through the queue to find the least i such that there are exactly r neighbors with index 
greater than i and record this as 7r r (efc). Finally, we add the two half-edges to their respective 
queues, dropping the top half-edge if the queues overflow. This process is done in 0{rm) time, 
0(r) to iterate through the queues, and 0(n + m) space. To compute (f> s we repeat the process in 
reverse, which takes 0{sm) time in 0{n + m) space. □ 

Lemma 6. Given a set of rectangles, the problem of determining which rectangles are stabbed by 
(enclosing) a query point can be reduced to a constant number of dominance counting queries. 

Proof. The number of rectangles stabbed by point q can be reduced ■ 

to a linear combination of the counts of rectangle corners belonging to -f- ! 

six different combinations of corner type and apex-g quadrant, using \ 

an inclusion-exclusion relation that seems to be folklore. Figure [2] pro- | 

vides an illustration: if we add +1 for each rectangle whose geometric 1 Q 

relationship to q is indicated by the blue L-shapes, and —1 for each d) 

rectangle whose relation to q is indicated by the red L-shapes, then . , . — 

each rectangle containing q adds a total of +1 to this sum (only for its I — [ I 

lower left corner) while each other rectangle adds zero (either with two 1 

corners that cancel each other, or no corners). Therefore the total sum ■ 
equals the number of rectangles stabbed by q. 

Therefore, to answer rectangle stabbing queries, we may build Figure 2. Turning stab- 
three (signed) dominance counting data structures, one for each of the king into dominance, 
nonempty quadrants in the figure, giving us the contributions from each quadrant. □ 
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Theorem 5. Given a relational event graph G the problem of determining the number of edges 
with at most r past neighbors and at most s future neighbors in Gij can be reduced to dominance 
counting in 0{{r + s)m) time and linear space. 

Proof. An edge et has at most r past neighbors and at most s future neighbors in Gij (and is 
in G^n) precisely when ir(ek) < i < k < j < 0(e/t). If we view (i,j) as a point in R 2 , then 
this happens when the rectangle [ir r (ek),k] x [k, 4> s (ek)] encloses the point (i,j), which reduces to 
dominance counting by Lemma [6} □ 

Corollary 1. Given a relational event graph G the problem of determining the number of edges 
with r past neighbors and s future neighbors, the number of isolated edges, and the number of edges 
with k neighbors in Gij can be reduced to dominance counting in 0{{r + s)m) time and linear space, 
except for the number of edges with k neighbors which takes 0{km) time and space. 

Proof. For edges with past and future neighbors we use Theorem [5] to compute the four data 
structures that compute N^i^i (the number of edges with past and future edges bounded by r' 
and s') for all combinations of r' £ {r,r — 1} and s' G {s, s — 1}. Then to compute the number of 
edges with exactly r past neighbors and s future neighbors we use inclusion-exclusion: 

N =r - S = N< r! < s — iV< ri < s _i — N< r -i^ a + N< r -i t < s -i. 

To count isolated edges we set r = s = 0. To count edges with exactly k neighbors we sum over the 
edges with r past and s future neighbors for all combinations of r and s satisfying r + s = k. □ 



7 Determining influence 

In this section we designate a fixed set of vertices as influential vertices and seek to find the number 
of influenced vertices, where vertex v is influenced if there is a path of index-increasing edges from 
a influential vertex to v. Such a path will be called a path of influence. If we think of the edges 
as communication events, then this models the flow of information from the influential vertices. 
Motivated by the degradation of information as it is relayed we also consider the number of h- 
influenced vertices, i.e., vertices that are on a path of influence with less than h edges. This query 
also does not appear to be an instance of the general matroid slice problem. 

For each edge insertion = (v,k,Vk) we define i(e^) to be the greatest i such that Vk is influenced 
in Gi t k, and A(e&) the least j such that Vk is influenced in Gkj- 

Lemma 7. The values of i and A can be computed in linear time and space. 

Proof. We consider the edges = {uk,Vk) in sequence order, setting t(efc) and A(efc) as we do. 

The edge e& = (uk,Vk) is on a path on influence only if either Uk is a influential vertex or 
an influenced vertex (t(itfc) is set). If is an influential vertex, then Lk( v k) = k. Otherwise, we 
set t(ufe) to t(itfc) when i{vk) < i<{uk) or i(vk) is unset, and do nothing when t(vfc) > t(uk). The 
computation of A is similar. □ 

Theorem 6. Given a relational event graph G the problem of determining the number of influenced 
vertices in the slice Gij can be reduced to dominance counting in linear time and space. 

Proof. We will count the number of influenced vertices in Gi j by counting the number of destination 
vertices of edges in Gij that are not influenced (counted with multiplicity) and then taking the 
complement. A vertex v cannot be influenced in Gij unless there is an edges e& = (uk,Vk) in Gij 
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with vj~ = v, i.e., i < k < j. Now is not influenced in G^j whenever t(efe) < i < k < j < A(efc), 
i.e., when the point is in the rectangle (t(efc), fe] x [fe, A(e&)). Now that the problem is reduced 
to rectangle stabbing we use Lemma [6j □ 

Theorem 7. Given a relational event graph and a predetermined value h the problem of determining 
the number of h-influenced vertices in the slice Gij can be reduced to dominance queries in 0(hm) 
time and linear space. 

Proof. We modify the argument in Lemma [7] and Theorem [6] to keep track of i and A for /c-influence 
for each vertex and for each choice of k < h instead of just for influence. It takes 0(h) time per 
edge to update these times of fc-influence. □ 



8 Counting triad closure events 

Define a triad closure event in an undirected relational event graph to be an edge e& within a given 
slice G^ such that e is the final edge of at least one triangle; that is, such that the other two edges 
of the triangle also belong to the same slice but are earlier in the sequence of edges than ep.. To 
count these events we define A(e&) to be the smallest index d such that e& does not belong to a 
triangle in Gd,k- Then, the number of triadic closure events for slice Gi j is exactly the number of 
edges e/c satisfying i < A(efc) < k < j, something that can be counted with the same mapping to 
R 2 and dominance query in Lemma[l] The difficulty, for this problem, is in the preprocessing: how 
do we compute A(e&) efficiently, for all edges e^? 

To solve this problem, we adapt a data structure of Eppstein and Spiro j|13] for counting triangles 
in a dynamic graph. This data structure is based on the concept of the h-index of the graph, the 
largest number h such that the graph contains at least h vertices of degree at least h; all graphs 
with m edges satisfy h = 0(y/m). Eppstein and Spiro maintain a slowly-changing partition of the 
graph vertices into two subsets H and L, where H contains 0(h) vertices and where every vertex 
in L has degree 0(h). We simplify this by computing the h-index of the aggregate graph and 
partitioning its vertices into static subsets H and L, where \H\ < h and where every vertex in L 
has degree at most h. 

Next, we loop through the edges in sequence order, maintaining as we do two hash tables E and 
P indexed by pairs of vertices. The first of these two tables, E[u,v], stores the most recent edge 
with those two endpoints (if such an edge has already been encountered in the edge sequence). The 
second table, P[u, v] stores the two-edge path from u to v via a third node w £ L that maximizes 
the index of the earlier of the two edges (u,w) and (w, v), if such a path exists and G has an 
edge (u, v). We also maintain an adjacency list for each vertex, listing the vertices connected to it 
by edges that have already been encountered. 

From this information, we can compute A(e/%) in time 0(h): let u and v be the endpoints of e^, 
look up in P[u, v] the best path through a vertex in L, and find the best path through a vertex in 
H by testing all h choices for this vertex using E to test each choice in constant time. Once A(e&) 
has been computed, we may also update E and the adjacency lists in constant time. To update P, 
for each endpoint v of that belongs to L, loop through each neighbor w of v, find the two-edge 
path combining and E[v,w], and use this path to update P[u,w] where u is the other endpoint 
of ejfc. This update process takes constant time per neighbor, and there are at most h neighbors, 
so again the time is 0(h). 

Theorem 8. Given an undirected relational event graph G the problem of determining the number 
of triad closured in the slice Gij can be reduced to dominance counting in 0(hm) time and linear 
space. 
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Proof. We perform the preprocessing steps described above to compute A(ejt) for each edge e^, in 
total time 0(hm), and then use the same persistent finger tree structure described in the matroid 
rank data structure (Lemma [T]), using A in place of the similar index r(efe) of the matroid rank 
data structure. □ 



9 Conclusions 

We have described data structures for many counting problems on slices of relational event data. 
Our analysis separates preprocessing from queries, but many of our data structures preprocess the 
data in sequence order, allowing queries to be interleaved with the addition of new data to the end 
of the sequence. 

Many interesting social network parameters remain to be addressed, including the clustering 
coefficient, the h- index, the number of vertices reachable via non-monotonic paths, and the size 
of the largest connected component. In addition, several of the parameters for the statistics we 
compute (such as the hop count and influential vertices in our influence-counting structure) must 
be determined at preprocessing time, and it would be of interest to develop more flexible structures 
that can delay the choice of these parameters until query time. Thus, although we have shown 
many interesting graph statistics to be computable efficiently in our model, much more remains to 
be done. 
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A Window-sensitive dominance counting 



Suppose we are given as input a set S of n points, with integer coordinates in the range from 1 to 
n; we wish to answer dominance counting queries, where a query specifies a point (x, y) and must 
count the number of points (x',y f ) € S with x' < x and y' < y. JaJa, Mortensen and Shi |24| 
provide a data structure for this problem, in the word RAM computation model, that uses linear 
space and achieves 0(logn/ log log n) query time. More precisely, they show (in their Lemma 5) 
that in a model of computation in which each word contains at least log n bits of information and in 
which tables of size n may be precomputed, then it is possible to represent sets of m points in space 
0(m) and achieve query time O (log m/ log log n). (Gupta et al. make an additional assumption, 
that the points of their data set have distinct coordinates, but this can be achieved with no loss 
of generality and with no change to their space or query time bounds by sorting the points by 
their coordinate values and replacing the coordinates by indices into the sorted order.) Applying 
this structure directly to the point sets generated from our reductions would give us query time 
O (log m/ log log m.) and space 0(m), where m is the number of edges in the given relational event 
graph. Instead, we show that it is possible to achieve slightly faster query time, 0(logw/loglogm), 
where w is the number of edges in the query slice. 

The key observations needed for this improvement are the following: 

• All of the points (xi,yi) in the point sets generated by our reductions satisfy Xi > yi\ that 
is, they lie below the main diagonal x = y of the n x n square forming the bounding box of 
the points. In the matroid rank problems, we may interpret Xi as being the index of each 
edge, and y% as being the number r(ej) which is always less than the index itself; similar 
observations apply to the other problems. 

• Each query on a slice Gjj is translated to dominance queries determined by the point 

The number of edges in the slice, j — i + 1, is proportional to the geometric distance (j — i)y2 
of this point from the main diagonal. 

We may assume without loss of generality that the quadrant in which we wish to count points 
for a query is the quadrant {(x, y) \ x < i Ay > j} that extends from the query point towards 
the main diagonal. It is not true that these are the only quadrants produced by our reductions 
from graph slice problems to dominance counting; however, the number of points in each of the 
other three quadrants may be easily computed by combining the number of points in this quadrant 
with halfspace range counting problems. The number of points in an axis-aligned halfspace can be 
determined trivially in linear space and constant time per query by precomputing the answer to 
each possible query halfspace. Thus, it remains to show that, given any set of points below the 
main diagonal of the square, we can answer dominance counting problems for quadrants that point 
towards the main diagonal, in an amount of time per query that is a function of the distance from 
the diagonal. 

To solve dominance counting problems on a given set of points, satisfying the assumptions, 
we partition the points into subsets, where subset Si contains the points whose distance from the 
main diagonal is at most (logn) 2 * and which are not in any set <SV for %' < i. Then, in outline, 
we use a local coordinate system for each subset S% in which the number of distinct coordinates 
is proportional to the number of points in Si (allowing the data structure of JaJa et al. to be 
used in a space-efficient way) and we cover each subset Si by data structures that each serve a 
range of 0((logn) 2! ) coordinates, in such a way that each point is covered by at most two data 
structures; again, this achieves linear space, while allowing a query within Si to be performed 
quickly. Finally, we use fractional cascading U\ to link each subset Si to the next subset Sj, 
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allowing the transformation into the local coordinate systems to be performed quickly and allowing 
us to quickly find the subset Sj in which it is most appropriate to perform the query, reducing all 
lower-level queries to constant-time halfspace counting queries. 
In more detail, we store the following for each subset Sj: 

• Lists of the points in Sj, sorted both by their ^-coordinates and by their y-coordinates. 

• For each point in Si, its indices in both sorted lists, allowing us to answer in constant time a 
halfspace counting query with the coordinate of that point. 

• Two lists Xi and Yi, consisting both of points in Si and of some points in Sj for j > i, sorted 
by their x-coordinates and y-coordinates respectively. consists of Si together with the 
elements at even positions in Xj + i, and similarly Yi consists of Si together with the elements 
at even positions in Yi + ±. Each entry in Xi or Yi contains pointers to the nearest point in the 
sorted list for Si and to the nearest point in Xj + i or lj+i. In this way, starting from So, we 
can navigate from Si to Sj+i in constant time. 

• For each point in Si, a translation of its coordinates into the local coordinate system of Si, 
obtained by compressing out coordinate values that occur neither as the x-coordinate nor as 
the y-coordinate of any point in Si. In this compressed coordinate system, all points remain 
below the main diagonal, and the number of distinct coordinates is at most equal to the 
number of points. 

• A sequence of the data structures of JaJa et al., each covering (for some integer k) the subset 
of points in Si whose local coordinates have y > /c(logn) 2 and x < (k + 2)(logn) 2 . Thus, 
there are at most 2(logn) 2 ' distinct x- and y-coordinates within one of these structures, so 
their query time is 



Any query defined by a point with j — i < (logn) 2 ' may be handled by one of these 

structures, determined in constant time by dividing the query coordinates by (logn) 2 \ Each 
point of Si belongs to two of these structures, so the total space for all of these structures is 



In addition, we store an array indexed by coordinate, mapping coordinates in the coordinate space 
of the whole point set to their positions in lists Xq and Yq. 

Theorem 9. Given a set of 0(n) points below the main diagonal in an n x n integer grid, we can 
process them into a data structure of size 0(n) that handles dominance queries for which the query 
point is at distance d from the main diagonal in time 0(logd/loglogn) per query. 

Proof. All of the data structures described above take space 0(|Sj|) for each set Si, so the total 
space is linear. 

To answer a query, we start in So- Within each set Sj for which the query quadrant extends 
beyond the distance of the set from the main diagonal and therefore could also contain points of 
Sj+i, we translate the query into two halfspace queries, answer these queries in constant time, and 
use the Xi and Yj structure to progress to the next set Sj+i in constant time. In the final set Sj, we 
translate the query into the local coordinate system and then use one of the data structures of JaJa 
ct al. stored for this set to answer the query directly in time 0(2*). This 0(2*) time dominates the 
query (everything else is 0(i)) and thus the time per query is 0{2 l ) = O (log dj log log n). □ 




0(|Sj|). 
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Figure 3: 24-leaf binary tree formed from the zeroless binary representation 24io = 21122- Each 
of the four shaded complete binary subtrees corresponds to one of the four digits of the binary 
representation, in left-to-right order. 



When translated to our relational event graph problems, this gives query time bounds of the 
form 0(log(j — i)j log log to) for querying slice Gij of a relational event graph with to edges. 

We observe that the same improvement may also be applied to the one-dimensional colored 
range counting problem considered by Gupta et al. [22]: as in our results, Gupta et al. transform 
the given input into a range counting problem on a set of two-dimensional points below the main 
diagonal. They use three-sided range queries rather than dominance counting, but their queries may 
be replaced by a linear combination of two axis-aligned halfspace queries and a dominance query. 
And, as in our problems, the length of the query interval for colored range counting translates into 
the distance of the dominance query point from the main diagonal. 



B Simplified dominance counting 

Our data structure for range searching uses fractional cascading layered on top of multiple copies 
of the structure of JaJa, Mortensen and Shi |24| , which itself is quite complex and in turn relies 
on the fusion trees of Fredman and Willard [17] , which are also complex. Therefore, although it 
achieves a good asymptotic space and query time complexity, we do not expect this combination 
of methods to be easy to implement. In this section we outline an alternative data structure for 
the same dominance counting problems that we expect to be more practical, although its time and 
space bounds are larger and we have not tested its practicality. Additionally, compared to the data 
structure in the previous appendix, the structure we define in this appendix has the theoretical 
advantage that it can handle queries with weighted points (dominance sum queries) and not just 
queries with unweighted points (dominance counting queries). 

In outline, our data structure for this problem uses path-copying persistence jTo] applied to a 
form of balanced binary tree, optimized for queries on small slices. The specific trees we use are 
based on the observation that every positive integer has a unique representation as a base-2 number 
in which each digit is either 1 or 2 (rather than the more traditional binary notation in which each 
digit is either or 1){^] For instance, 

24i = 2112a = 2 x 2 3 + 1 x 2 2 + 1 x 2 1 + 2 x 2°. 

Based on this fact, for every n we can form a tree T n with exactly n leaves and n — 1 internal 
vertices: we represent n as n = Yli=o ^ l wnere each b{ £ {1,2} and — 1 is the number of digits in 

2 For an analogous representation of positive integers in base 10 using digits with values from 1 to 10, without a 



zero digit, see Foster 15 . 
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the representation of n. We form a tree starting from a path of k nodes, extending leftwards from 
the root; the right child of the node in this path at distance i from the root is a complete binary 
tree with 6j2* leaves, and the left child of the last node in this path (at distance k — 1 from the 
root) is a complete binary tree with bk2 k leaves. Figure [3] illustrates this construction for n = 24. 

In T n , the path from the root to the ith leaf (in the left-to-right ordering of the leaves) has length 
0(log(ra — i)): it takes at most log 2 (n — i) steps to reach the complete binary subtree containing 
the ith leaf, and another log 2 (n — i) + 0(1) steps to reach the leaf from the root of this subtree. In 
addition, the structural change needed to form T n+ i from T n is small: the binary representation of 
n+1 may be obtained from the representation of n by changing trailing 2's to l's and incrementing 
the lowest order digit that is not a 2, and each of these operations corresponds to 0(1) changes to 
the structure of the tree. So, in the worst case, T n and T n+ i differ in the connections of O(logn) 
of their nodes, and the average change per step in constructing T n from T% by a sequence of these 
increment steps is 0(1). In particular, T(n) can be constructed in time 0(n). 

We now describe how to use these trees to solve dominance range sum queries. We assume we 
are given as input a set of n points (xi,yi), each with a weight w%. As in the previous section, we 
assume that < ?/, < X{ < n, so all points are on or below the main diagonal of the n x n integer 
grid. We wish to handle queries that are given as arguments a pair of coordinates (x, y) and that 
return the query value 

Q(x, y) = ^{wi I Xi < x A yi > y). 

That is, we sum the weights of the points in the quadrant of the plane directed towards the main 
diagonal from the query point. 

To do so, for each value of x in the range from 1 to n — 1 we store a tree T x with the structure 
described above, with exactly x leaves. We represent each interior node of this tree as an object x, 
with four instance variables: a weight x.w, a count x.c, and left and right child pointers x.l and x.r. 
We do not explicitly represent the leaf nodes of the tree, but they are useful for defining its structure. 
The count variable for each node stores the number of leaves in the right subtree beneath that node; 
it is zero for nodes that are themselves leaves. Although leaves are not represented explicitly within 
our structure, we nevertheless define the weight of the leaf in position i (in the left to right order 
of the leaves, starting from position for the leftmost leaf) to be 

^{wi I Xi < x A yi = y}. 

That is, it is the sum of weights of points within row y of the n x n grid, up to column x. The 
weight of a node that is not a leaf is the sum of the weights of the leaves in its right subtree. 

To save space, we make the trees T x for different values of x share as much of their structure as 
they can. In particular, if trees T x and T x+ \ both contain nodes whose descendants form isomorphic 
subtrees, with leaves in the same positions in the left-to-right order and with the same leaf weights, 
then our data structure reuses the same node object for both of them. However, the parts of T x 
and T x+ \ that differ either structurally or in the weights stored in those nodes are represented in 
the data structure by separate nodes. Finally, for each x we store a pointer to the root of T x and 
we store a number W x , the total weight of all the leaves in T x . Figure [4] illustrates this structure 
of shared trees, weights, and counts for a set of points within a 5 x 5 grid. 

To answer a query (x,y), we perform a binary search for y in tree T x , using the count values 
stored in each tree node to guide whether to step leftwards or rightwards at each point of the search. 
The query value is then the sum of the weight values of the nodes at which this search stepped 
leftwards. As a special case, the query (x, —1) (which asks for the sum of weights of all points with 
Xi < x) is handled by returning W x . Thus, each query can be answered in time 0(log(x — y)). 
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Figure 4: The trees T n for < n < 5 derived from a 5 x 5 grid, and the pairs x.uu, x.c stored 
with each internal node of each tree. The leaf nodes are shown in the figure but not explicitly 
represented. 

To construct T x from T x —i, we perform the structural rearrangements needed to form T x (cre- 
ating new node objects for the root of each subtree in T x that does not also appear as a subtree in 
T x ._i. Then, for each point (x», yi) with X{ = x — 1, we add Wi to the weight value of leaf yi in T x , 
and update the cumulative weights stored at each ancestor of this leaf, creating new copies of each 
ancestor node in order to be able to store these updated weight values in T x without disturbing the 
values already computed for T x —i. The total number of new nodes that need to be created in this 
step for each point {xi, y{) is 0(log(xj — yi))- Thus, the total number of new nodes needed to create 
the entire structure, which gives the space requirement for the structure as well as its construction 
time, is 0(n + X^log^i - yi)). 

We have proved the following result: 

Theorem 10. Suppose we are given m weighted points below the main diagonal in an n x n grid. 
Then in time 0{n + mlogn) we may preprocess these points into a data structure of size 0{n + 
mlogn) that supports dominance sum queries, given by a query point (x,y), in time 0(log(x — y)) 
per query. 

C Lower bounds 

Strengthening earlier results of Chazelle 16), Patra§cu provided lower bounds for two-dimensional 
range counting [33] that we adapt to our windowed relational event problems. Specifically, he 
showed that, for n given points in the Euclidean plane, it is hard to answer dominance queries, 
asking for the number of given points (xi,yi) with x^ < X and yi <Y for some query pair (X, Y). 
In the cell probe model of Fredman and Saks |16|, with O(logra) bits per machine word, every data 
structure that can answering such queries using space 0(n\og°^ n) requires fi(logrt/ log log n) 
time per query. 

We adapt this lower bound to our windowed relational event problems, by showing how to 
translate a given set of n Euclidean points into a synthetic relational event data set in such a way 
that windowed queries into this data set simulate range counting queries. To do so, we construct 
a (static) 1-regular graph with 2n vertices and n isolated edges ej. We may assume without loss 
of generality (by perturbing the Euclidean points if necessary) that no two points have the same 
x- or y-coordinate as each other; since only the ordering of the points by their coordinates matters 
for handling dominance queries, we may also assume (as Patra§cu does) that their coordinates are 
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Figure 5: Lower bound example for influenced vertices. The red vertex is influential; each remaining 
vertex is influenced only when the path leading to it is included in the slice. 



all integers in the range from to n — 1. That is, we are assuming that there exists a permutation 
7r of the integers from to n — 1 such that the points in the given set of points all have coordinates 
of the form (i,ir(i)). 

Given a point set in this form, we define a relational event data set with 2n events, where for 
each i in the range from to n — 1, we include two copies of edge e, in the data set, one at time 
n — i — 1 and a second copy at time n + ir{i). In this way, the number of given points dominated 
by the query pair (X, Y) will exactly equal the number of repeated edges in the slice G n -x-i,n+Y- 

Theorem 11. For each of the problems of counting components, counting loopy components, count- 
ing isolated vertices, counting isolated edges, and counting repeated edges, any data structure for a 
relational event graph with m edges and space 0(mlog ^ m) requires fi(logm/loglogm) time per 
query. 

Proof. For the data set produced by our translation, the answer to any one of these queries can be 
combined with the (trivially calculated) number of edges in a slice to give the number of repeated 
edges within a slice. Using the translation described above, this could then be used to answer 
two-dimensional range counting queries in the same asymptotic query bound. Since range counting 
queries cannot be answered more quickly than the query time stated in the theorem, neither can 
these graph queries. □ 

A similar construction using a different relational event data set in the form of a tree of height 
two, using two-edge paths in place of the pairs of equal edges, shows that the same lower bound 
also holds for counting influenced vertices. Figure [5] shows the construction: for each pair of edges 
( e i> /i); the endpoint of ej will be influenced whenever ej lies in the query interval, but the endpoint 
of ft will be influenced only if both e« and ft both lie in the query interval. Thus, as above, we 
can translate a two-dimensional range counting instance (represented as a permutation it of the 
numbers from to n — 1 by including a edge at time n — i in a relational event data set and by 
including edge fa at time n + n(i). We set the root of the tree as the sole influential vertex. 

Theorem 12. Any data structure for counting influenced vertices in a relational event graph with 
m edges and space 0(mlog ^ m) requires 0(logm/loglogm) time per query. 

Proof. We use the translation described above. The answer to a dominance counting query for 
the query point (X, Y) is given by c — X, where c is the number of influenced vertices in the slice 
G n -x-i,n+Y- Q 
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